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Abstract 

We describe a slightly sub-exponential time algorithm for learning parity functions in the 
presence of random classification noise. This results in a polynomial-time algorithm for the case 
of parity functions that depend on only the first O(lognloglogn) bits of input. This is the first 
known instance of an efficient noise-tolerant algorithm for a concept class that is provably not 
learnable in the Statistical Query model of Kearns ||. Thus, we demonstrate that the set of 
problems learnable in the statistical query model is a strict subset of those problems learnable 
in the presence of noise in the PAC model. 

In coding-theory terms, what we give is a poly(n)-time algorithm for decoding linear k x n 
codes in the presence of random noise for the case of k = c log n log log n for some c > 0. (The 
case of k = O(logn) is trivial since one can just individually check each of the 2 k possible 
messages and choose the one that yields the closest codeword.) 

A natural extension of the statistical query model is to allow queries about statistical proper- 
ties that involve t-tuples of examples (as opposed to single examples). The second result of this 
paper is to show that any class of functions learnable (strongly or weakly) with t-wise queries for 
t = O(logn) is also weakly learnable with standard unary queries. Hence this natural extension 
to the statistical query model does not increase the set of weakly learnable functions. 

1 Introduction 

An important question in the study of machine learning is: "What kinds of functions can be learned 
efficiently from noisy, imperfect data?" The statistical query (SQ) framework of Kearns || was 
designed as a useful, elegant model for addressing this issue. The SQ model provides a restricted 
interface between a learning algorithm and its data, and has the property that any algorithm 
for learning in the SQ model can automatically be converted to an algorithm for learning in the 
presence of random classification noise in the standard PAC model. (This result has been extended 
to more general forms of noise as well |5[ ||].) The importance of the Statistical Query model is 
attested to by the fact that before its introduction, there were only a few provably noise-tolerant 
learning algorithms, whereas now it is recognized that a large number of learning algorithms can 
be formulated as SQ algorithms, and hence can be made noise-tolerant. 

The importance of the SQ model has led to the open question of whether examples exist of 
problems learnable with random classification noise in the PAC model but not learnable by statis- 
tical queries. This is especially interesting because one can characterize information-theoreticaliy 
(i.e., without complexity assumptions) what kinds of problems can be learned in the SQ model 
For example, the class of parity functions, which can be learned efficiently from non-noisy 
data in the PAC model, provably cannot be learned efficiently in the SQ model under the uniform 
distribution. Unfortunately, there is also no known efficient non-SQ algorithm for learning them 



in the presence of noise (this is closely related to the classic coding-theory problem of decoding 
random linear codes). 

In this paper, we describe a polynomial-time algorithm for learning the class of parity functions 
that depend on only the first 0(log n log log n) bits of input, in the presence of random classification 
noise (of a constant noise rate). This class provably cannot be learned in the SQ model, and thus is 
the first known example of a concept class learnable with noise but not via statistical queries. Our 
algorithm has recently been shown to have applications to the problem of determining the shortest 
lattice vector length || and to various other analyses of statistical queries 0] . 

An equivalent way of stating this result is that we are given a random k x n boolean matrix 
A, as well as an n-bit vector y produced by multiplying A by an (unknown) /c-bit message x, and 
then corrupting each bit of the resulting codeword y = xA with probability rj < 1/2. Our goal 
is to recover y in time poly(n). For this problem, the case of k = O(logn) is trivial because one 
could simply try each of the 2 k possible messages and output the nearest codeword found. Our 
algorithm works for k = c log n log log n for some c > 0. The algorithm does not actually need A 
to be random, so long as the noise is random and there is no other codeword within distance o(n) 
from the true codeword y. 

Our algorithm can also be viewed as a slightly sub-exponential time algorithm for learning 
arbitrary parity functions in the presence of noise. For this problem, the brute-force algorithm 
would draw 0(n) labeled examples, and then search through all 2 n parity functions to find the 
one of least empirical error. (A standard argument can be used to say that with high probability, 
the correct function will have the lowest empirical error.) In contrast, our algorithm runs in time 
2°( n / lo s n ') ) though it also requires 2 ( n//logn ) labeled examples. This improvement is small but 
nonetheless sufficient to achieve the desired separation result. 

The second result of this paper concerns a fc-wise version of the Statistical Query model. In the 
standard version, algorithms may only ask about statistical properties of single examples. (E.g., 
what is the probability that a random example is labeled positive and has its first bit equal to 
1?) In the /c-wise version, algorithms may ask about properties of A:-tuples of examples. (E.g., 
what is the probability that two random examples have an even dot-product and have the same 
label?) Given the first result of this paper, it is natural to ask whether allowing fc-wise queries, 
for some small value of k, might increase the set of SQ-learnable functions. What we show is that 
for k = O(logn), any concept class learnable from k-wise queries is also (weakly) learnable from 
unary queries. Thus the seeming generalization of the SQ model to allow for 0(logn)-wise queries 
does not close the gap we have demonstrated between what is efficiently learnable in the SQ and 
noisy-PAC models. Note that this result is the best possible with respect to k because the results 
of 0] imply that for k = w(logn), there are concept classes learnable from fc-wise queries but not 
unary queries. On the other hand, o;(logn)-wise queries are in a sense less interesting because it is 
not clear whether they can in general be simulated in the presence of noise. 

1.1 Main ideas 

The standard way to learn parity functions without noise is based on the fact that if an example 
can be written as a sum (mod 2) of previously-seen examples, then its label must be the sum (mod 
2) of those examples' labels. So, once one has found a basis, one can use that to deduce the label of 
any new example (or, equivalently, use Gaussian elimination to produce the target function itself). 

In the presence of noise, this method breaks down. If the original data had noise rate 1/4, 
say, then the sum of s labels has noise rate 1/2 — (l/2) s . This means we can add together only 
O(logn) examples if we want the resulting sum to be correct with probability 1/2 + l/poly(n). 
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Thus, if we want to use this kind of approach, we need some way to write a new test example as a 
sum of only a small number of training examples. 

Let us now consider the case of parity functions that depend on only the first k = log n log log n 
bits of input. Equivalently, we can think of all examples as having the remaining n — k bits equal to 
0. Gaussian elimination will in this case allow us to write our test example as a sum of k training 
examples, which is too many. Our algorithm will instead write it as a sum of k/logk = O(logra) 
examples, which gives us the desired noticeable bias (that can then be amplified). 

Notice that if we have seen poly(n) training examples (and, say, each one was chosen uniformly 
at random), we can argue existentially that for k = log n log log n, one should be able to write 
any new example as a sum of just O(loglogn) training examples, since there are n °(io g io g n) > 2 k 
subsets of this size (and the subsets are pairwise independent). So, while our algorithm is finding 
a smaller subset than Gaussian elimination, it is not doing best possible. If one could achieve, say, 
a constant-factor approximation to the problem "given a set of vectors, find the smallest subset 
that sums to a given target vector" then this would yield an algorithm to efficiently learn the class 
of parity functions that depend on the first k = 0(log 2 n) bits of input. Equivalently, this would 
allow one to learn parity functions over n bits in time 2°^\ compared to the 2°( n / logn ) time of 
our algorithm. 

2 Definitions and Preliminaries 

A concept is a boolean function on an input space, which in this paper will generally be {0, l} n . A 
concept class is a set of concepts. We will be considering the problem of learning a target concept 
in the presence of random classification noise [jl]]. In this model, there is some fixed (known or 
unknown) noise rate rj < 1/2, a fixed (known or unknown) probability distribution D over {0, l} n , 
and an unknown target concept c. The learning algorithm may repeatedly "press a button" to 
request a labeled example. When it does so, it receives a pair (x,£), where x is chosen from {0, l} n 
according to T> and £ is the value c(x), but "flipped" with probability rj. (I.e., £ = c(x) with 
probability 1 — rj, and £ = 1 — c(x) with probability rj.) The goal of the learning algorithm is to find 
an e- approximation of c: that is, a hypothesis function h such that Pr x ^x>[h(x) = c(x)] > 1 — e. 

We say that a concept class C is efficiently learnable in the presence of random classification 
noise under distribution D if there exists an algorithm A such that for any e > 0,5 > 0, rj < 1/2, 
and any target concept c £ C, the algorithm A with probability at least 1 — 5 produces an e- 
approximation of c when given access to P-random examples which have been labeled by c and 
corrupted by noise of rate rj. Furthermore, A must run in time polynomial in n, 1/e, and l/<5-[] 

A parity function c is defined by a corresponding vector c E {0, l} n ; the parity function is then 
given by the rule c(x) = x ■ c (mod 2). We say that c depends on only the first k bits of input if 
all nonzero components of c lie in its first k bits. So, in particular, there are 2 k distinct parity 
functions that depend on only the first k bits of input. Parity functions are especially interesting 
to consider under the uniform distribution T>, because under that distribution parity functions are 
pairwise uncorrelated. 



1 Normally, one would also require polynomial dependence on 1/(1/2 — rj) — in part because normally this is easy 
to achieve (e.g., it is achieved by any statistical query algorithm). Our algorithms run in polynomial time for any 
fixed rj < 1/2, but have a super-polynomial dependence on 1/(1/2 — rj). 
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2.1 The Statistical Query model 

The Statistical Query (SQ) model can be viewed as providing a restricted interface between the 
learning algorithm and the source of labeled examples. In this model, the learning algorithm may 
only receive information about the target concept through statistical queries. A statistical query 
is a query about some property Q of labeled examples (e.g., that the first two bits are equal and 
the label is positive), along with a tolerance parameter r G [0,1]. When the algorithm asks a 
statistical query (Q,r), it is asking for the probability that predicate Q holds true for a random 
correctly-labeled example, and it receives an approximation of this probability up to ±r. In other 
words, the algorithm receives a response Pq £ [Pq — t,Pq + r], where Pq = Pt x< -z>[Q(x, c(x))]. 
We also require each query Q to be polynomially evaluable (that is, given (x,i), we can compute 
Q(x,£) in polynomial time). 

Notice that a statistical query can be simulated by drawing a large sample of data and computing 
an empirical average, where the size of the sample would be roughly 0(l/r 2 ) if we wanted to assure 
an accuracy of r with high probability. 

A concept class C is learnable from statistical queries with respect to distribution D if there is 
a learning algorithm A such that for any c £ C and any e > 0, A produces an e-approximation 
of c from statistical queries; furthermore, the running time, the number of queries asked, and the 
inverse of the smallest tolerance used must be polynomial in n and 1/e. 

We will also want to talk about weak learning. An algorithm A weakly learns a concept class 
C if for any c G C and for some e < 1/2 — l/poly(n), A produces an e-approximation of c. That 
is, an algorithm weakly learns if it can do noticeably better than guessing. 

The statistical query model is defined with respect to non-noisy data. However, statistical 
queries can be simulated from data corrupted by random classification noise ||. Thus, any concept 
class learnable from statistical queries is also PAC-learnable in the presence of random classification 
noise. There are several variants to the formulation given above that improve the efficiency of the 
simulation [g, but they are all polynomially related. 

One technical point: we have defined statistical query learnability in the "known distribution" 
setting (algorithm A knows distribution T>); in the "unknown distribution" setting, A is allowed to 
ask for random unlabeled examples from the distribution T>. This prevents certain trivial exclusions 
from what is learnable from statistical queries. 

2.2 An information-theoretic characterization 

BFJKMR [Q] prove that any concept class containing more than polynomially many pairwise un- 
corrected functions cannot be learned even weakly in the statistical query model. Specifically, they 
show the following. 

Definition 1 (Def. 2 of fify) For concept class C and distribution T>, the statistical query dimension 
SQ-DIM(C,T>) is the largest number d such that C contains d concepts ci, . . . ,q that are nearly 
pairwise uncorrelated: specifically, for all i ^ j, 



Pr _[ci(x) = Cj(x)\ - Pr k(x) / Cj(x)} 

c<— D x^D 



< l/d 3 



Theorem 1 (Thm. 12 of ^j) In order to learn C to error less than 1/2 — l/d? in the SQ model, 
where d = SQ-DIM(C,T>), either the number of queries or 1/r must be at least \d l /' i 
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Note that the class of parity functions over {0, 1}™ that depend on only the first 0(log n log log n) 
bits of input contains n°( loglogn ) functions, all pairs of which are uncorrelated with respect to the 
uniform distribution. Thus, this class cannot be learned (even weakly) in the SQ model with poly- 
nomially many queries of l/poly(n) tolerance. But we will now show that there nevertheless exists 
a polynomial-time PAC-algorithm for learning this class in the presence of random classification 
noise. 

3 Learning Parity with Noise 

3.1 Learning over the uniform distribution 

For ease of notation, we use the "length-A; parity problem" to denote the problem of learning a 
parity function over {0, l} fc , under the uniform distribution, in the presence of random classification 
noise of rate r\. 

Theorem 2 The length-k parity problem, for noise rate r\ equal to any constant less than 1/2, can 
be solved with number of samples and total computation-time 2°( fc / lo s fe ). 

Thus, in the presence of noise we can learn parity functions over {0, l} n with in time and 
sample size 

2 0(n/ log n) > and 

we can learn parity functions over {0, l} n that only depend on the first 
k = O (log n log log n) bits of the input in time and sample size poly(n). 

We begin our proof of Theorem ^ with a simple lemma about how noise becomes amplified 
when examples are added together. For convenience, if x\ and X2 are examples, we let x\ + X2 
denote the vector sum mod 2; similarly, if t\ and £2 are labels, we let l\ +£2 denote their sum mod 
2. 

Lemma 3 Let (x\,£i), . . . , (x s ,£ s ) be examples labeled by c and corrupted by random noise of rate 
rj. Then l\ -\ + l s is the correct value of (x\ + • • • + x s ) ■ c with probability | + |(1 — 2rj) s . 

Proof. Clearly true when s = 1. Now assume that the lemma is true for s — 1. Then the probability 
that £\ + ■ ■ ■ + £ s = {x\ + • • • + x s ) ■ c is 

(1 - V)(h + |(1 - 2r?r 1 ) + r?(i - i(l - fy)- 1 ) = \ + ±(1 - 2nf . 

The lemma then follows by induction. 

The idea for the algorithm is that by drawing many more examples than the minimum needed 
to learn information-theoretically, we will be able to write basis vectors such as (1,0, ... ,0) as 
the sum of a relatively small number of training examples — substantially smaller than the num- 
ber that would result from straightforward Gaussian elimination. In particular, for the length 
O(lognloglogn) parity problem, we will be able to write (1,0, ... ,0) as the sum of only O(logn) 
examples. By Lemma |, this means that, for any constant noise rate rj < 1/2, the corresponding 
sum of labels will be polynomially distinguishable from random. Hence, by repeating this process 
as needed to boost reliability, we may determine the correct label for (1,0,..., 0), which is equiv- 
alently the first bit of the target vector c. This process can be further repeated to determine the 
remaining bits of c, allowing us to recover the entire target concept with high probability. 

To describe the algorithm for the length-/c parity problem, it will be convenient to view each 
example as consisting of a blocks, each b bits long (so, k = ab) where a and b will be chosen later. 
We then introduce the following notation. 
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Definition 2 Let V, be the subspace of {0, l} ab consisting of those vectors whose last i blocks have 
all bits equal to zero. An z-sample of size s is a set of s vectors independently and uniformly 
distributed over Vi. 

The goal of our algorithm will be to use labeled examples from {0, l} ab (these form a 0-sample) to 
create an z-sample such that each vector in the i-sample can be written as a sum of at most 2 % of 
the original examples, for all i = 1, 2, . . . , a — 1. We attain this goal via the following lemma. 

Lemma 4 Assume we are given an i-sample of size s. We can in time O(s) construct an (i + 1)- 
sample of size at least s — 2 b such that each vector in the (i + 1) -sample is written as the sum of 
two vectors in the given i-sample. 

Proof. Let the i-sample be x\, . . . , x s . In these vectors, blocks a—i+1, . . . , a are all zero. Partition 
xi, . . . ,x s based on their values in block a — i. This results in a partition having at most 2 b classes. 
From each nonempty class p, pick one vector Xj p at random and add it to each of the other vectors 
in its class; then discard Xj p . The result is a collection of vectors u\,... ,u s >, where s' > s — 2 b 
(since we discard at most one vector per class). 

What can we say about u±, . . . , u s i? First of all, each Uj is formed by summing two vectors in 
Vi which have identical components throughout block a — i, "zeroing out" that block. Therefore, 
Uj is in Vi + \. Secondly, each Uj is formed by taking some Xj and adding to it a random vector in 
Vi, subject only to the condition that the random vector agrees with Xj p on block a — i. Therefore, 
each Uj is an independent, uniform-random member of Vi+\. The vectors u\,..., u s i thus form the 
desired (i + l)-sample. 

Using this lemma, we can now prove our main theorem. 

Proof of Theorem [2]. Draw a2 b labeled examples. Observe that these qualify as a 0-sample. 
Now apply Lemma ||, a — 1 times, to construct an (a — l)-sample. This (a — l)-sample will have 
size at least 2 b . Recall that the vectors in an (a — l)-sample are distributed independently and 
uniformly at random over V a -i, and notice that V a -\ contains only 2 b distinct vectors, one of which 
is (1, 0, . . . , 0). Hence there is an approximately 1 — 1/e chance that (1, 0, ... , 0) appears in our 
(a — l)-sample. If this does not occur, we repeat the above process with new labeled examples. 
Note that the expected number of repetitions is only constant. 

Now, unrolling our applications of Lemma|4|, observe that we have written the vector (1,0, ... , 0) 
as the sum of 2 a ~ 1 of our labeled examples — and we have done so without examining their labels. 
Thus the label noise is still random, and we can apply Lemma ||. Hence the sum of the labels gives 
us the correct value of (1, 0, . . . , 0) • c with probability | + ^ (1 — 2n) 2a . 

This means that if we repeat the above process using new labeled examples each time for 
poly(( 1 _ 1 2r? ) 2a , b) times, we can determine (1,0, ... ,0) • c with probability of error exponentially 
small in ab. In other words, we can determine the first bit of c with very high probability. And 
of course, by cyclically shifting all examples, the same algorithm may be employed to find each 
bit of c. Thus, with high probability we can determine c using a number of examples and total 
computation-time poly(( 1 _ 1 2?; ) 2a , 2 b ). 

Plugging in a = \ \gk and b = 2k/\gk yields the desired 2°( fc / logfc ) bound for constant noise 
rate rj. 

3.2 Extension to other distributions 

While the uniform distribution is in this case the most interesting, we can extend our algorithm 
to work over any distribution. In fact, it is perhaps easiest to think of this extension as an online 
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learning algorithm that is presented with an arbitrary sequence of examples, one at a time. Given 
a new test example, the algorithm will output either "I don't know" , or else will give a prediction 
of the label. In the former case, the algorithm is told the correct label, flipped with probability r\. 
The claim is that the algorithm will, with high probability, be correct in all its predictions, and 
furthermore will output "I don't know" only a limited number of times. In the coding-theoretic 
view, this corresponds to producing a 1 — o(l) fraction of the desired codeword, where the remaining 
entries are left blank. This allows us to recover the full codeword so long as no other codeword is 
within relative distance o(l). 

The algorithm is essentially a form of Gaussian elimination, but where each entry in the matrix 
is an element of the vector space F2 rather than an element of the field F2. In particular, instead 
of choosing a row that begins with a 1 and subtracting it from all other such rows, what we do 
is choose one row for each initial 6-bit block observed: we then use these (at most 2 b — 1) rows 
to zero out all the others. We then move on to the next 6-bit block. If we think of this as an 
online algorithm, then each new example seen either gets captured as a new row in the matrix (and 
there are at most a(2 b — 1) of them) or else it passes all the way through the matrix and is given 
a prediction. We then do this with multiple matrices and take a majority vote to drive down the 
probability of error. 

For concreteness, let us take the case of n examples, each k bits long for k = \ lgn(lglgn — 2), 
and 77 = 1/4. We view each example as consisting of (lglgn — 2) blocks, where each block has 
width I lgn. We now create a series of matrices Mi, M2, ... as follows. Initially, the matrices are 
all empty. Given a new example, if its first block does not match the first block of any row in 
Mi, we include it as a new row of Mi (and output "I don't know"). If the first block does match, 
then we subtract that row from it (zeroing out the first block of our example) and consider the 
second block. Again, if the second block does not match any row in Mi we include it as a new 
row (and output "I don't know"); otherwise, we subtract that row and consider the third block 
and so on. Notice that each example will either be "captured" into the matrix Mi or else gets 
completely zeroed out (i.e., written as a sum of rows of Mi). In the latter case, we have written the 
example as a sum of at most 2 lglgra ~ 2 = jlgn previously-seen examples, and therefore the sum of 
their labels is correct with probability at least |(1 + 1/n 1 / 4 ). To amplify this probability, instead of 
making a prediction we put the example into a new matrix M2, and so on up to matrix M n 2/3- If an 
example passes through all matrices, we can then state that the majority vote is correct with high 
probability. Since each matrix has at most 24 lgn (lglgn — 2) rows, the total number of examples 
on which we fail to make a prediction is at most n 11//12 lglgn = o(n). 

3.3 Discussion 

Theorem [2] demonstrates that we can solve the length-n parity learning problem in time 2°^ n \ 

However, it must be emphasized that we accomplish this by using 2°( n / logn ) labeled examples. 

For the point of view of coding theory, it would be useful to have an algorithm which takes time 

2°( n ^ and number of examples poly(n) or even 0{n). We do not know if this can be done. Also of 

interest is the question of whether our time-bound can be improved from 2°( n / logn ) to, for example, 
2 o(V»)_ 

It would also be desirable to reduce our algorithm's dependence on rj. This dependence comes 
from Lemma ||, with s = 2 a ~ 1 . For instance, consider the problem of learning parity functions that 
depend on the first k bits of input for k = O(lognloglogn). In this case, if we set a = [i lglgn] 

and 6 = O(logn), the running time is polynomial in n, with dependence on r\ of )"^ losn . This 
allows us to handle r\ as large as 1/2 — 2~V logn anc [ still have polynomial running time. While this 
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can be improved slightly, we do not know how to solve the length-O(lognloglogn) parity problem 
in polynomial time for n as large as 1/2 — 1/n or even 1/2 — l/n e . What makes this interesting 
is that it is an open question (Kearns, personal communication) whether noise tolerance can in 
general be boosted; this example suggests why such a result may be nontrivial. 



4 Limits of O(log n)-wise Queries 

We return to the general problem of learning a target concept c over a space of examples with a 
fixed distribution V. A limitation of the statistical query model is that it permits only what may 
be called unary queries. That is, an SQ algorithm can access c only by requesting approximations 
of probabilities of form Pr x [Q(x,c(x))\, where x is P-random and Q is a polynomially evaluable 
predicate. A natural question is whether problems not learnable from such queries can be learned, 
for example, from binary queries: i.e., from probabilities of form Pr xljX2 [Q(x±, X2, c(xi), c(x2))]. 
The following theorem demonstrates that this is not possible, proving that 0(logra)-wise queries 
are no better than unary queries, at least with respect to weak-learning. 

We assume in the discussion below that all algorithms also have access to individual unlabeled 
examples from distribution V, as is usual in the SQ model. 

Theorem 5 Let k = O(logn), and assume that there exists a poly(n)-time algorithm using k-wise 
statistical queries which weakly learns a concept class C under distribution V. That is, this algo- 
rithm learns from approximations o/Pr^ [Q(x, c(x))\, where Q is a polynomially evaluable predicate, 
and x is a k-tuple of examples. Then there exists a poly (n) -time algorithm which weakly learns the 
same class using only unary queries, under V. 



Proof. We are given a /c-wise query Pr^ [Q(x, c(x))]. The first thing our algorithm will do is use Q 
to construct several candidate weak hypotheses. It then tests whether each of these hypotheses is 
in fact noticeably correlated with the target using unary statistical queries. If none of them appear 
to be good, it uses this fact to estimate the value of the /c-wise query. We prove that for any /c-wise 
query, with high probability we either succeed in finding a weak hypothesis or we output a good 
estimate of the /c-wise query. 

For simplicity, let us assume that Pr x [c(x) = 1] = 1/2; i.e., a random example is equally likely 
to be positive or negative. (If Pr x [c(x) = 1] is far from 1/2 then weak-learning is easy by just 
predicting all examples are positive or all examples are negative.) This assumption implies that if a 
hypothesis h satisfies |Pr x [h(x) = 1 A c(x) = 1] — ^Pr^ [h(x) = 1] | > e, then either h(x) or 1 — h(x) 
is a weak hypothesis. 

We now generate a set of candidate hypotheses by choosing one random /c-tuple of unlabeled 
examples z. For each 1 < i < k and £ G {0, l} fc , we hypothesize 

= Q(Zi, • • • , Zi—li X, Zi, . . • , £), 

and then use a unary statistical query to tell if h Si ^{x) or 1 — h- i j{x) is a weak hypothesis. As 
noted above, we will have found a weak hypothesis if 



Pr, 



Q(zi, . . . ,Zi-i,x,Zi+i, . . . ,z k ,l) A c(x) = ll - ^Pr x \q(z\, . . . , Zi-i, x, z i+ i, ...,z k ,£) 



2 

We repeat this process for 0(1/ e) randomly chosen /c-tuples z. We now consider two cases. 



> e. 
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Case I: Suppose that the ith label matters to the A:-wise query Q for some i and I. By this we 
mean there is at least an e chance of the above inequality holding for random z. Then with high 
probability we will discover such a z and thus weak learn. 

Case II: Suppose, on the contrary, that for no i or i does the ith label matter, i.e. the probability 
of a random z satisfying the above inequality is less than e. This means that 



Q(zi, . . . ,Zi_i,x,z i+1 , . . . ,z k ,t) A c(x) = 1 



-Pr* 
2 



Q(zi, 



,Zk,i) 



< 2e. 



By bucketing the Fs according to the values of c(zi), . . ., c(zj_i) we see that the above implies that 
for all &!,..., E {0,1}, 



Pr ? 



Q(z,l)Ac(zt) = bi A ... Ac(zi_i) = bi-i A c(zj) = 1 



-Pr 2 - 
2 2 



Q(z,£) A c(zi) = &i A ... A cOj-i) = 



< 2e. 



By a straightforward inductive argument on i, we conclude that for every b £ {0, l} fc , 



Pr ? 



Q0?,£) Ac(2) = b 



1 

2^ 



Pr= 



Q(z,£) 



< 4e(l 



1) 



This fact now allows us to estimate our desired /c-wise query Pr^ [Q(z, c(z))]. In particular, 

Pr 2 - [Q(z, c(z))] = Pr ^ [Qft £ 1 a c (^) = ^ • 

te{o,i} fe 

We approximate each of the 2 k = poly(n) terms corresponding to a different t by using unlabeled 
data to estimate ^Pr^ Q(z, £) . Adding up these terms gives us a good estimate of Pr? [Q(z, c(z))] 
with high probability. 



4.1 Discussion 

In the above proof, we saw that either the data is statistically "homogeneous" in a way which 
allows us to simulate the original learning algorithm with unary queries, or else we discover a 
"heterogeneous" region which we can exploit with an alternative learning algorithm using only 
unary queries. Thus any concept class that can be learned from 0(log n)-wise queries can also 
be weakly learned from unary queries. Note that Aslam and Decatur (2| have shown that weak- 
learning statistical query algorithms can be boosted to strong-learning algorithms, if they weak- 
learn over every distribution. Thus, any concept class which can be (weakly or strongly) learned 
from 0(logn)-wise queries over every distribution can be strongly learned over every distribution 
from unary queries. 

It is worth noting here that fc-wise queries can be used to solve the length-A: parity problem. One 
simply asks, for each i G {1, . . . , k}, the query: "what is the probability that k random examples 
form a basis for {0, l} k and, upon performing Gaussian elimination, yield a target concept whose 
ith bit is equal to 1?" Thus, A;- wise queries cannot be reduced to unary queries for k = a; (log n). On 
the other hand, it is not at all clear how to simulate such queries in general from noisy examples. 
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5 Conclusion 



In this paper we have addressed the classic problem of learning parity functions in the presence of 
random noise. We have shown that parity functions over {0, l} n can be learned in slightly sub- 
exponential time, but only if many labeled examples are available. It is to be hoped that future 
research may reduce both the time-bound and the number of examples required. 

Our result also applies to the study of statistical query learning and PAC-learning. We have 
given the first known noise-tolerant PAC-learning algorithm which can learn a concept class not 
learnable by any SQ algorithm. The separation we have established between the two models is 
rather small: we have shown that a specific parity problem can be PAC-learned from noisy data in 
time poly(n), as compared to time n°( loglogn ) for the best SQ algorithm. This separation may well 
prove capable of improvement and worthy of further examination. Perhaps more importantly, this 
suggests the possibility of interesting new noise-tolerant PAC-learning algorithms which go beyond 
the SQ model. 

We have also examined an extension to the SQ model in terms of allowing queries of arity k. 
We have shown that for k = O(logn), any concept class learnable in the SQ model with A;- wise 
queries is also (weakly) learnable with unary queries. On the other hand, the results of j| imply 
this is not the case for k = cj(logn). An interesting open question is whether every concept class 
learnable from 0(lognloglogn)-wise queries is also PAC-learnable in the presence of classification 
noise. If so, then this would be a generalization of the first result of this paper. 
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